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YET ANOTHER ZETA FUNCTION AND LEARNING 
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Abstract. We analyze completely the convergence speed of the 
^■^ batch learning algorithm, and compare its speed to that of the 

£»\j , memoryless learning algorithm and of learning with memory (as 

analyzed in [KR2001b|). We show that the batch learning algo- 



rithm is never worse than the memoryless learning algorithm (at 
least asymptotically). Its performance vis-a-vis learning with full 
memory is less clearcut, and depends on certain probabilistic as- 
sumptions. These results necessitate the introduction of the mo- 
ment zeta function of a probability distribution and the study of 
^^ ' some of its properties. 

Introduction 

The original motivation for the work in this paper was provided 
by research in learning theory, specifically in various models of lan- 



O ■ guage acquisition (see, for example, ||KNN2001| , |NKN2001j |KN2001|| ) 



In the paper [|KR2001b|| , we had studied the speed of convergence of the 



memoryless learner algorithm, and also of learning with full memory. 
Since the batch learning algorithm is both widely known, and believed 
to have superior speed (at the cost of memory) to both of the above 
methods by learning theorists, it seemed natural to analyze its behav- 
ior under the same set of assumptions, in order to bring the analysis 
^ ■ in |[KK2001a[ ] and |KK2001rJ| to a sort of closure. It should be noted 



that the detailed analysis of the batch learning algorithm is performed 
under the assumption of independence, which was not explicitly present 
in our previous work. For the impatient reader we state our main result 
(Theorem |8.1|) immediately (the reader can compare it with the results 
on the memoryless learning algorithm and learning with full memory, 
as summarized in Theorem |2T 



Theorem A. Let N& be the number of steps it takes for the student 
(with probability 1) to have probability 1 — A of learning the concept 
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2 IGOR RIVIN 

using the batch learner algorithm. Then we have the following estimates 
forN A : 

• If the distribution of overlaps is uniform, or more generally, the 
density function /(l — x) at has the form f(x) = c + 0(x s ), 
6,c>0, then N A = |logA|9(n) 

• If the probability density function /(l — x) is asymptotic to x@ + 
0(x /3_<5 ), 5, j3 > 0, as x approaches 0, then we have N A = 

liogAie^ 1 /^)); 

• If the asymptotic behavior is as above, but — 1/2 < j3 < 0, then 
^ A =|logA|e(n 1 /( 1+ ^). 

The plan of the paper is as follows: in this Introduction we recall the 
learning algorithms we study; in Section [I] we define our mathematical 
model; in Section 2 we recall our previous results, in Section 3 we begin 
the analysis of the batch learning algorithm, and introduce some of the 
necessary mathematical concepts; in Sections 4-6 we analyze the three 
cases stated in Theorem A, and we summarize our findings in Section 
7. 



Memoryless Learning and Learning with Full Memory. The 
general setup is as follows: There is a collection of concepts Rq, . . . , R n 
and words which refer to these concepts, sometimes ambiguously. The 
teacher generates a stream of words, referring to the concept Rq. This is 
not known to the student, but he must learn by, at each step, guessing 
some concept Ri and checking for consistency with the teacher's input. 
The memoryless learner algorithm consists of picking a concept Ri at 
random, and sticking by this choice, until it is proven wrong. At this 
point another concept is picked randomly, and the procedure repeats. 
Learning with full memory follows the same general process with the 
important difference that once a concept is rejected, the student never 
goes back to it. It is clear (for both algorithms) that once the student 
hits on the right answer R , this will be his final answer. We would 
like to estimate the probability of having guessed the right answer is 
after k steps, and also the expected number of steps before the student 
settles on the right answer. 



Batch Learning. The batch learning situation is similar to the above, 
but here the student records the words wi, . . . , Wk, ■ ■ ■ he gets from the 
teacher. For each word Wi , we assume that the student can find (in his 
textbook, for example) a list Lj of concepts referred to by the word. If 
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we define 

k 

£fc = [ I Li, 

i=i 

then we are interested in the smallest value of k such that Ck = {-Ro}- 
This value ko is the time it has taken the student to learn the concept 
i?o- We think of k as a random variable, and we wish to estimate its 
expectation. 

1. The mathematical model 

We think of the words referring to the concept Ro as a probability 
space V. The probability that one of these words also refer to the 
concept Ri shall be denoted by pi\ the probability that a word refers 
to concepts R^, . . . ,Ri k shall be denoted by Pi x ...i k . All the resu hs 
described below (obviously) depend in a crucial way on the pi, . . . ,p n 
and (in the case of the batch learning algorithm) also on the joint 
probabilities. Since there is no a priori reason to assume specific values 
for the probabilities, we shall assume that all of the Pi are themselves 
independent, identically distributed random variables. We shall refer 
to their common distribution as J 7 , and to the density as /. It turns 
out that the convergence properties of the various learning algorithms 
depend on the local analytic properties of the distribution T at 1 - 
some moments reflection will convince the reader that this is not really 
so surprising. 

To carry out a precise analysis of the batch learning algorithm, we 
will also need the independence hypothesis: 

Pi\...i k — Pi\ ■ ■ ■ Pi k - 

It is again not too surprising that some such assumption on correla- 
tions ought to be required for precise asymptotic results, though it is 
obviously the subject of a (non-mathematical) debate as to whether 
assuming that the various concepts are truly independent is reasonable 
from a cognitive science point of view. 

2. Previous results 



In previous work ||KR200ia] and [ |KR200rb|| we obtained the follow- 
ing result. 

Theorem 2.1. Let N& be the number of steps it takes for the student 
(with probability 1) to have probability 1 — A of learning the concept. 
Then we have the following estimates for N&: 
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• if the distribution of overlaps is uniform, or more generally, the 
density function f (1 — x) atO has the form f (x) = c+0(x s ), 5,c> 
0, then Na = | log A|0(nlogn) for the memoryless algorithm and 
Na = (1 — A) 2 0(nlogn) when learning with full memory; 

• if the probability density function /(l — x) is asymptotic to x 13 + 
0(x /3_<5 ), 5, (3 > 0, as x approaches 0, then for the two algo- 
rithms we have respectively N& = |logA|0(n) and N& = (1 — 
A) 2 0(n); 

• if the asymptotic behavior is as above, but —1<(3<0, then 
Na = | log A|0(n 1 ^ 1+ ^) for the memoryless learner and (N& = 
1 — A) 2 0(n 1 ^ 1+/3 - ) ) for learning with full memory. 

Recall that f(x) = Q(g(x)) means that for sufficiently large x, the ra- 
tio f(x)/g(x) is bounded between two strictly positive constants. The 
distribution of overlaps referred to above is simply the distribution T. 
Notice that the theorem says nothing about the situation when JF is 
supported in some interval [0, a], for a < 1. That case is (presumably) 
of scientific interest, but mathematically it is relatively trivial: we re- 
place the arguments of all the 0s above by 1, though, of course, we are 
thereby hiding the dependence on a. 

3. General bounds on the batch learner algorithm 

Consider a set of words w\, . . . ,Wk- The probability that they all 
refer to the concept Ri is, obviously p\. 

Lemma 3.1. The probability q^ that we still have not learned the con- 
cept Rq after k steps is bounded above by Ys7=iPi> an< ^ below by maxjpf . 

Proof. Immediate. □ 

We will first use these upper and lower bounds to get corresponding 
bounds on the convergence speed of the batch learner algorithm, and 
then invoke the independence hypothesis to sharpen these bounds in 
many cases. 

We begin with a trivial but useful lemma. 

Lemma 3.2. Let G be a game where the probability of success (respec- 
tively failure) after at most k steps is Sk (respectively fk — 1 — Sk). 
Then the expected number of steps until success is 

OO OO CO 

y^ k(s k - s k -i) = y^gfc = i ~y^jk, 
fc=i fc=i fc=i 

if the corresponding sum converges. 
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Proof. The proof is immediate from the definition of expectation and 
the possibility of rearrangment of terms of positive series. □ 

We can combine Lemma RT2| and Lemma 13.11 to obtain: 



Theorem 3.3. The expected time T of convergence of the batch learner 
algorithm is bounded as follows: 

T) \ > T > max 



1 



l<i<n 1 — Pi 



The leftmost term in equation ([]]) has been studied at length in 
|KK2001a|| . We state a version of the results of ||KK2001a|| below: 



Theorem 3.4. Let S = Ym=i I^ - ' w here the pi are independently 
identically distributed random variables with values in [0, 1], with prob- 
ability density f, such that f(l — x) = x 13 + 0(x l3 ~ 5 ), 5 > forx — > 0. 
Then If (3 > 0, then there exists a mean m, such that lim. n ^ 00 ¥(\S/n — 
m\ > e) = 0, for any e > 0. If (3 = 0, then lim„^ 00 P(|S'/(nlogn) — 1| > 
e) = 0). Finally, if -1 < /3 < 0, then lim^oo P(S/n 1 / /3+1 - C > a) = 
g(a), it>/jere lim^oo g(a) = 0, andC is an arbitrary (but fixed) constant, 
and likewise 

P(5/n 1 /( /3+1 ) < b) = h(b), 

where lim^o h(a) = 0, 

The right hand side of Eq. (|l|) is easier to understand. Indeed, let 
Pi, . . . ,p n be distributed as usual (and as in the statement of Theorem 



3.4. Then 



Theorem 3.5. The expected value o/maxi<j<„j9j equals 1 — Cn 1 / 1+/3 ) 
for some positive constant C . 

Proof. First, we change variables to qi = 1 — Pi. Obviously, the state- 
ment of the Theorem is equivalent to the statement that E = E(min!<j< n q^) 
Cn~ x ' 1+ P . We also write h(x) = /(l — x), and similarly for the prim- 
itives H and F. Now, the probability of that all of the g, are greater 
than some fixed y equals 1 — (1 — H(y)) n , so that 



E= [ td[l-(l-H(t)r}= f (1-H(t)) 
Jo Jo 

Perform the change of variables t = u/n l ^ l+fi \ to get 



(2) E = -^ I (1 - H(u/n 1/{1+ ®)) n du. 



n i+p 
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For u <C n x '^ l+ ^, we can write Hiu/n 1 '^^' x vP +1 /nH', where H' 
is a constant. We also know that if is a monotonic function so if we 
break up the integral above as 



(3) E ' 



72,1/(1+/?) 



n l/(2(l+/3)) l/(l+/3) ' 

+ / 

J n l/(2(l+«) 



;i-if( M /^ /(1+/3) )) n rfM, 



we see that the first integral approaches C = J °°exp(— v}l( 1+ ^)du, 
while the second integral goes to 0. Note that the proof also evaluates 
C. U 

We need one final observation: 

Theorem 3.6. The variable n 1 ^ 1+ ^' min™ =1 qi has a limiting distribu- 
tion with distribution function G(x) = 1 — exp(— x l+l3 ). 



Proof. Immediate from the proof of Theorem |3.5| . D 

We can now put together all of the above results as follows. 

Theorem 3.7. Let pi, . . . ,p k be independently distributed with com- 
mon density function f , such that /(l — x) = ex 13 + 0(x l3+s ), 5 > 0. 
Let T be the expected time of the convergence of the batch learning 
algorithm with overlaps p 1 ,...,p k . Then, if (3 > 0, then there ex- 
ist C\,C%, such that C\n l ^ l+ ^ < T < C^n, with probability tending 
to 1 as n tends to oo. If (3 = 0, then there exist C\,C%, such that 
C\n <T< C^nlogn, with probability tending to one as n tends to oo. 
If/3>0, then C^n 1 ^ 13 ^ < T < Cn l /^ +1 ^ with probability tending to 
as C goes to infinity. 

The reader will remark that in the case that (3 > 0, the upper and 
lower bounds have the same order of magnitude as functions of n. 

4. Independent concepts 

independence hypothesis, whereby an application of the inclusion- 
exclusion principle gives us: 

Lemma 4.1. The probability l k that we have learned the concept Rq 
after k steps is given by 

n 

I* = ll(l-Pf)- 

i=\ 

Note that the probability s k of winning the game on the k-th step 
is given by s k = l k — h-i — (1 — h-i) — (1 — h)- Since the expected 
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number of steps T to learn the concept is given by 

oo 

fc=i 
we immediately have 



t = £>-/* 



fc=l 

Lemma 4.2. The expected time T of learning the concept R is given 
by 

oo / n \ 

k=l \ i=l J 

Since the sum above is absolutely convergent, we can expand the 
products and interchange the order of summation to get the following 
formula for T: 

(4) T= £ (-I)!""' f>J= £ (-l)N-' (_^ - l) , 

sC{l,..,n} fc=l sC{l,...,n} V Fs / 

where we have identified subsets of {1, . . . ,n} with the corresponding 
multindexes. 

The formula [| is useful in and of itself, but we now use it to attempt 
to get the expectation of the expected time of success T under our 
distribution and independence assumption. For this we shall need the 
following: 

Definition 4.3. Let T be a probability distribution on an interval I , 
and let m^F) = J I x k J r (dx) be the k-th moment of J- '. Then the 
moment zeta function of T is defined to be 

oo 

fc=l 
whenever the sum is defined. 

Lemma 4.4. Let J 7 be a probability distribution as above, and letxi, . . . , x r 
be independent random variables with common distribution T . Then 

(5) e(- )=0(n). 

VI -xi ...x n J 

In particular, the expectation is undefined whenever the zeta function 
is undefined. 
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Proof. Expand the fraction in a geometric series and apply Fubini's 
theorem. □ 

Example 4.5. For JF the uniform distribution on [0, 1], (jr is the fa- 
miliar Riemann zeta function. Notice that this is not defined for n = 1 
- this will be important in the sequel. 

It should be noted that in the case we are interested in (distributions 
supported in [0,1]), the asymptotics of the moments are determined 
by the local properties of the distribution at 1, up to exponentially 
decreasing error terms. So, if f(l — x) x x^ (recall that / is the 
density), we see that the k-th. moment of J 7 is asymptotic to Ck~( 1+a \ 
for some constant C. To show this, we first define the Mellin transform 
of / to be 

M(f)(s) = [ f{x)x s ~ 1 dx. 
Jo 
We see that m^F) = Jv[(f)(k + 1). Mellin transform is very closely 
related to the Laplace transform. Indeed, making the substitution 
x = exp(— u), we see that 

POO 

M.{f) = I f(exp(—u))exp(—su)du, 
Jo 

so the Mellin transform of / is equal to the Laplace transform of /oexp . 

Now, the asymptotics of the Laplace transform are easily computed by 

Laplace's method, and in the case we are interested in, Watson's lemma 

(see, eg, ||BenOrszl ) tells us that if f(x) x c(l — x) 13 , then J\4(f)(s) x 



cT(P)x~^ +1 \ In particular, Cf( s ) is defined for s > 1/(1 + (3). Below 
we shall analyze three cases (though the analysis is almost the same in 
the three cases, there are some important variations). In the sequel, 
we set a = [3 + 1. 

5. a > 1 
In this case, we use our assumptions to rewrite Eq. (f|) as 

(6) r = -^^V-i)*c r (*). 

This, in turn, can be rewritten (by expanding the definition of zeta) as 



(7) T = -^[(l-m,(^)r-l] 

3=1 

Since the term in the sum is monotonically decreasing, the sum in Eq. 
(0) can be approximated by an integral (of any monotonic interpolation 
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m of the sequence m^T)] however there is no reason not to set m(x) = 
Ai(f)(x + 1)), with error bounded by the first term, which is, in term, 
bounded in absolute value by 2, to get 

[(l-m(x)) n -l]dx + 0(l), 

where the error term is bounded above by 2. 

Now, let us assume that m(x) is of order x~ a for some a > 1. We 
substitute x = n l / al P ha / u , to get 

u f nl/a [1 - (1 - m(n l l a /u) n ] 

T = n 1/a / 1 i i '—^-du + 0(l) 

Jo u 

=B v.r ,/ -[i-(i-y/Ti A , +0(1) 

Jo u 

= „* ( r + r n [i-(i- m >K/<i, a + 0(1) 

where m' is a bounded (asymptotically constant) function. In the sec- 
ond integral the integrand is bounded above by 1/u 2 , so the contri- 
bution from that integral goes to 0, while in the first integral we can 
approximate (1 — m'u a /n) n by exp(— m'u a ), and the contribution from 
that integral goes to 

(10) T = n^ r ^-eM-rn'(u)u a ) du + Q(1) x ^v* 
Jo u 



6. a = 1 

In this case, f(x) = c + o(l) as x approaches 1. It is not hard to see 
that (r{n) is defined for n > 2. We break up the expression in Eq. (|j) 

as 



j=l Pj sC{l,...,n}, |a|>l ^ Ps 



Let 



- 1 

r -£r^ 



Pi 



71 = £ (-^-'(r^:- 1 )' 

sC{l,...,n}, |s|>l \ /'s / 
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The first sum 7\ has no expectation, however T\jn does have have a 
stable distribution centered on clogn + c 2 - We will keep this in mind, 
but now let us look at the second sum T 2 . It can be rewritten as 

00 

(12) T 2 = - J2 [(1 - ™^)) n - 1 + nm 3 ] . 

i=i 
The same method as in section [| under the assumption that the k-th 
moment is asymptotic to k a (this time for a < 1) can be used to write 

(13) 
T 2 = nl [l-nm(n/ M )-(l-m(n/ M r] dM + Q(1) 

■ m ' {u)u/n)n] du + Q(l). 




The conclusion differs somewhat from that of section |5| in that we get 
an additional term of cnlogn, where c = linx^! f(x) = limj^^jruj. 
This term is equal (with opposing sign) to the center of the stable law 
satisfied by Ti, so in case a = 1, we see that T has no expectation but 
satisfies a law of large numbers, of the following form: 

Theorem 6.1 (Law of large numbers). There exists a constant C such 
that lim^oo P(\T/n - C\ > y) = 0. 

7. a < 1 

In this case the analysis goes through as in the preceding section 
when a > 1/2, but then runs into considerable difficulties. However, 



in this case we note that Theorem 3.7 actually gives us tight bounds. 



8. The inevitable comparison 

We are now in a position to compare the performance of the batch 
learning algorithm with that of the memoryless learning algorithm and 
of learning with full memory, as summarized in Theorem |2.1|. We 



combine our computations above with the observation that the batch 
learner algorithm converges geometrically (Lemma [4.1[ ), to get: 

Theorem 8.1. Let N& be the number of steps it takes for the student 
(with probability 1) to have probability 1 — A of learning the concept 
using the batch learner algorithm. Then we have the following estimates 
forN A : 

• If the distribution of overlaps is uniform, or more generally, the 
density function /(l — x) at has the form f\x) = c + 0(x s ), 
6,c> 0, then N A = |logA|6(n) 
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If the probability density function /(l — x) is asymptotic to x 13 + 

0(x@~ s ), 6, j3 > 0, as x approaches 0, then we have N& = 

llogAie^ 1 /^^); 

If the asymptotic behavior is as above, but —l</3<0, then 

N A = llogAietn 1 /^)). 



Comparing Theorems |2.1| and |S.1| , we see that batch learning algo- 
rithm is uniformly superior for j3 > 0, and the only one of the three 
to achieve sublinear performance whenever (3 > (the other two never 
do better than linearly, unless the distribution T is supported away 
from 1.) On the other hand, for (3 < 0, the batch learning algorithm 
performs comparably to the memoryless learner algorithm, and worse 
than learning with full memory. 
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